This assignment is focused on exploring categorical-to-continuous variable relationships and continuous-to-continuous variable relationships. It is not open ended like the last two assignments. There are certain tasks you must complete for all problems, but you will gain experience with the different plot types introduced in the Week 07 recordings. You will practice creating, modifying, interpreting, and communicating insights from them. The last question requires you to visually explore relationships associated with one of the final projects of your choosing.
You must download the 3 data sets provided in the Canvas assignment page and save them to the appropriate directory on your computer.
Type the names of other students you worked with here.
For each of the 3 assigned data sets you must perform the following ESSENTIAL activities:
You do NOT need to display basic descriptive statistics and counts. You will visually explore the variables in each problem.
You will work with the NumPy, Pandas, matplotlib.pyplot, and Seaborn modules in this assignment.
Import NumPy, Pandas, matplotlib.pyplot, and Seaborn using their commonly accepted aliases.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df01=pd.read_csv("hw07_prob_01.csv")
df01
| x | value | |
|---|---|---|
| 0 | A | 10.079154 |
| 1 | B | 10.899639 |
| 2 | C | 2.169568 |
| 3 | D | 5.002276 |
| 4 | A | 6.715621 |
| ... | ... | ... |
| 2795 | D | -0.739087 |
| 2796 | A | 3.368663 |
| 2797 | B | 1.561011 |
| 2798 | C | 0.438753 |
| 2799 | D | 7.749233 |
2800 rows × 2 columns
df01.shape
(2800, 2)
df01.dtypes
x object value float64 dtype: object
df01.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2800 entries, 0 to 2799 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x 2800 non-null object 1 value 2800 non-null float64 dtypes: float64(1), object(1) memory usage: 43.9+ KB
df01.x.value_counts()
A 700 B 700 C 700 D 700 Name: x, dtype: int64
df01.nunique()
x 4 value 2800 dtype: int64
df01.isna().sum()
x 0 value 0 dtype: int64
df01.describe()
| value | |
|---|---|
| count | 2800.000000 |
| mean | 3.602424 |
| std | 3.092654 |
| min | -5.421891 |
| 25% | 1.301328 |
| 50% | 3.040259 |
| 75% | 5.619022 |
| max | 20.348324 |
df01_summary_info=df01.groupby(['x'], dropna=False).\
aggregate(value_rows=('value', 'size'),
value_nonmiss=('value', 'count'),
value_min=('value', 'min'),
value_mean=('value', 'mean'),
value_max=('value', 'max'),
value_sem=('value', 'sem'),
value_unique=('value','nunique')
).\
reset_index()
df01_summary_info
| x | value_rows | value_nonmiss | value_min | value_mean | value_max | value_sem | value_unique | |
|---|---|---|---|---|---|---|---|---|
| 0 | A | 700 | 700 | -1.082039 | 4.489812 | 15.105555 | 0.130043 | 700 |
| 1 | B | 700 | 700 | -5.421891 | 3.447927 | 12.231985 | 0.111557 | 700 |
| 2 | C | 700 | 700 | 0.025424 | 3.263984 | 20.348324 | 0.111404 | 700 |
| 3 | D | 700 | 700 | -1.545187 | 3.207971 | 8.046591 | 0.106765 | 700 |
We have 2800 of different numeric data divided equally into 4 categories A,B,C,D. None has any missing value. The lowest numeric data is -5.421891 occurs in group B and the highest numeric data is 20.348324 occurs in group C. The mean has the highest variability in group A.
sns.catplot(data=df01, x='x', kind='count')
plt.show()
The unique values are equally distributed accross x. They are balanced.
sns.displot(data=df01,x='value', kind='hist')
<seaborn.axisgrid.FacetGrid at 0x10756c130>
The data is more or less normal. A little bit skewed to the right.
You will now explore the categorical-to-continuous relationship between the non-numeric column and numeric column in df01.
Create a BOX PLOT using Seaborn to visualize the summary statistics of the numeric column GIVEN the non-numeric column.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the BOX PLOT?
sns.catplot(data=df01, x='x', y='value', kind='box')
plt.show()
Medians vary accross categories. A(highest as we saw in EDA), D have the high variabilities and C is the least variable. C also has the highest extreme values(outliers).
sns.catplot(data=df01, x='x', y='value', kind='point', join=False)
plt.show()
95% CI (t dist) for A is clearly seperated from the rest. We can conclude that we are 95% confident that the average for A is different and higher than the rest. For B,C,D although the means are decreasing for this sample; we are not confident that they will be different in general.
sns.catplot(data=df01, x='x', y='value', kind='violin', inner='quartile')
plt.show()
Yes, the conditional distributions are different. In B, D the distribution is summetric where as in A, C the distribution is right skewed.
Create a CONDITIONAL KDE plot using Seaborn to show the conditional density of the numeric column GIVEN the non-numeric column. The non-numeric column must be associated with the KDE color.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the CONDITIONAL KDE plot?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?
sns.displot(data=df01, x='value',hue='x', kind='kde', common_norm=False)
plt.show()
The kde plots support are violin plot conclusions. The conditional distributions are different. In B, D the distribution is summetric where as in A, C the distribution is right skewed. Here we can observe additionally that C has even more concentration around 0 than A.
hue is responsible to add colors corresponding to different categories.
common_norm=False removes the sample size effect although in this data all the categories have same no 700 of observations and hence, common_norm is not effective.
Create a FACTED HISTOGRAM plot using Seaborn to show the conditional histogram of the numeric column GIVEN the non-numeric column. The non-numeric column must be associated with the COLUMN FACETS. The x and y scales of the facets must be free or not-shared across the facets.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the FACTED HISTOGRAM?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the COLUMN FACET to a column in the data?
sns.displot(data=df01, x='value',col='x', kind='hist',\
col_wrap=2, facet_kws={'sharex': False, 'sharey': False})
plt.show()
The conditional distributions are different. B has the symmetric, more or less normal distribution and D has the symmetric, more or less uniform distribution where as A,C are right skewed. C is almost fully skewed.
col= under Seaborn displot argument allows us to ASSOCIATE or link the COLUMN FACET to a column in the data.
You have explored the CONDITIONAL DISTRIBUTIONS of the numeric column GIVEN the non-numeric column.
Which plot types made it easy to COMPARE summary statistics across the categories?
Which plot types made it easy to COMPARE the distributional SHAPE across the categories?
I feel violin plot made it easy to COMPARE summary statistics across the categories and kde plot makes it easier to COMPARE the distributional SHAPE across the categories.
df02=pd.read_csv("hw07_prob_02.csv")
df02
| x1 | x2 | m1 | |
|---|---|---|---|
| 0 | -1.244161 | 1.278363 | A |
| 1 | -1.180081 | 1.198855 | A |
| 2 | -0.658225 | 0.713579 | A |
| 3 | 0.390327 | -0.487877 | A |
| 4 | -0.015277 | -0.049614 | A |
| ... | ... | ... | ... |
| 895 | 1.221421 | 1.475893 | I |
| 896 | 0.353086 | 0.487289 | I |
| 897 | 0.090534 | 0.121304 | I |
| 898 | -0.826389 | -0.983525 | I |
| 899 | -0.428498 | -0.652316 | I |
900 rows × 3 columns
df02.shape
(900, 3)
df02.dtypes
x1 float64 x2 float64 m1 object dtype: object
df02.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 900 entries, 0 to 899 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 900 non-null float64 1 x2 900 non-null float64 2 m1 900 non-null object dtypes: float64(2), object(1) memory usage: 21.2+ KB
df02.isna().sum()
x1 0 x2 0 m1 0 dtype: int64
df02.nunique()
x1 900 x2 900 m1 9 dtype: int64
def count_missing(series):
return series.isna().sum()
df02_summary_info = df02.groupby(['m1'], dropna=False).\
aggregate(x1_rows=('x1', 'size'),
x1_nonmiss=('x1', 'count'),
x1_miss=('x1',count_missing),
x1_min=('x1', 'min'),
x1_mean=('x1', 'mean'),
x1_max=('x1', 'max'),
x1_sem=('x1', 'sem'),
x1_unique=('x1', 'nunique'),
x2_rows=('x2', 'size'),
x2_nonmiss=('x2', 'count'),
x2_miss=('x2',count_missing),
x2_min=('x2', 'min'),
x2_mean=('x2', 'mean'),
x2_max=('x2', 'max'),
x2_sem=('x2', 'sem'),
x2_unique=('x2', 'nunique')
).\
reset_index()
df02_summary_info
| m1 | x1_rows | x1_nonmiss | x1_miss | x1_min | x1_mean | x1_max | x1_sem | x1_unique | x2_rows | x2_nonmiss | x2_miss | x2_min | x2_mean | x2_max | x2_sem | x2_unique | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A | 100 | 100 | 0 | -2.361043 | 0.091667 | 2.970587 | 0.102270 | 100 | 100 | 100 | 0 | -2.940527 | -0.073645 | 2.374314 | 0.101130 | 100 |
| 1 | B | 100 | 100 | 0 | -2.202832 | 0.083659 | 2.794577 | 0.104235 | 100 | 100 | 100 | 0 | -2.250538 | -0.052223 | 2.416286 | 0.098140 | 100 |
| 2 | C | 100 | 100 | 0 | -3.435065 | -0.079557 | 2.281274 | 0.098465 | 100 | 100 | 100 | 0 | -2.648666 | 0.040023 | 2.726775 | 0.094518 | 100 |
| 3 | D | 100 | 100 | 0 | -2.310966 | -0.075117 | 2.585018 | 0.109554 | 100 | 100 | 100 | 0 | -2.479288 | 0.179923 | 2.527816 | 0.103563 | 100 |
| 4 | E | 100 | 100 | 0 | -2.701852 | 0.015560 | 2.465939 | 0.095035 | 100 | 100 | 100 | 0 | -2.263450 | -0.123164 | 2.564615 | 0.098234 | 100 |
| 5 | F | 100 | 100 | 0 | -2.572121 | -0.127923 | 2.707330 | 0.109105 | 100 | 100 | 100 | 0 | -2.511288 | 0.052244 | 2.536549 | 0.118507 | 100 |
| 6 | G | 100 | 100 | 0 | -2.121500 | -0.039613 | 3.068722 | 0.105044 | 100 | 100 | 100 | 0 | -2.507380 | 0.038718 | 2.886683 | 0.111675 | 100 |
| 7 | H | 100 | 100 | 0 | -2.728041 | 0.041222 | 2.996371 | 0.105815 | 100 | 100 | 100 | 0 | -2.326055 | 0.033702 | 2.727422 | 0.100438 | 100 |
| 8 | I | 100 | 100 | 0 | -2.329354 | 0.265349 | 2.860297 | 0.103347 | 100 | 100 | 100 | 0 | -2.168624 | 0.259289 | 2.753163 | 0.103595 | 100 |
We have 900 of rows divided equally into 9 categories A to I. None has any missing value. The lowest x1 numeric data is -3.43 occurs in group C and the highest x1 numeric data is 3 occurs in group G. The x1 mean is maximum in I and has the highest variability in group D. Similarly we can do in x2 but I will leave it for visual explanation.
sns.catplot(data=df02, x='m1', kind='count')
plt.show()
The unique values are equally distributed accross m1. They are balanced.
Create HISTOGRAMS using Seaborn to visualize the marginal distributions of the continuous variables in df02.
You may create separate figures for each histogram based on the WIDE FORMAT data OR reshape the data into LONG FORMAT and create separate FACETS for each variable. You CANNOT use for-loops to create the separate histograms.
Do the marginal distribution appear symmetric?
df02.hist()
plt.show()
sns.displot(data=df02, kind='hist', aspect=2)
plt.show()
Yes, the marginal distributions appear symmetric.
Create CONDITIONAL KDE plots using Seaborn to show the conditional densities of each numeric column GIVEN the non-numeric column. The non-numeric column must be associated with the KDE color.
You may create separate figures for each histogram based on the WIDE FORMAT data OR reshape the data into LONG FORMAT and create separate FACETS for each variable. You CANNOT use for-loops to create the separate histograms.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the CONDITIONAL KDE plot?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?
sns.displot(data=df02, x='x1',hue='m1', kind='kde', common_norm=False)
plt.show()
sns.displot(data=df02, x='x2',hue='m1', kind='kde', common_norm=False)
plt.show()
The CONDITIONAL DISTRIBUTIONS appear more or less same to the CONDITIONAL KDE plot as normal distribution or bell shaped curve. (One thing I want to mention to avoid confusion in general that whenever I say normal distribution, I usually mean a bell shaped curve as we actually don't know so far whether it would be normal distribution or a t or chi square distribution with proper degress of freedom.)
hue is responsible to add colors corresponding to different categories. common_norm=False removes the sample size effect.
Create BOX PLOTS using Seaborn to visualize the summary statistics of the numeric columns GIVEN the non-numeric column.
You may create separate figures for each boxplot based on the WIDE FORMAT data OR reshape the data into LONG FORMAT and create separate FACETS for each variable. You CANNOT use for-loops to create the separate boxplots.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the BOX PLOT?
sns.catplot(data=df02, x='m1', y='x1', kind='box')
plt.show()
sns.catplot(data=df02, x='m1', y='x2', kind='box')
plt.show()
Medians vary accross categories. D,F,G,I have the high variabilities and H is the least variable for x1 and x2 both.
sns.relplot(data=df02, x= 'x1', y='x2')
plt.show()
I can't see a clear relationship here.
Let's now check if the continuous variable relationship depends on the non-numeric variable.
Create a scatter plot between the continuous variables using Seaborn. Color the markers based on the non-numeric column to study if the relationship CHANGES across the categories.
Does the CONDITIONAL RELATIONSHIP appear DIFFERENT across the CATEGORIES?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
sns.relplot(data=df02, x= 'x1', y='x2', hue='m1')
plt.show()
sns.relplot(data=df02, x= 'x1', y='x2', col='m1',hue='m1', col_wrap=3)
plt.show()
Concluding based on color was also difficult as we have a lot of different values in categories so I used column seperation. We can see a positive linear trend b/w x1 and x2 for m1=G,H,I(it seems increasing correlation value by view) and negative linear trend b/w x1 and x2 for m1=A,B,C(it seems increasing correlation value by view) and no relation for m1=D,E,F.
hue and col seaborn argument allows you to ASSOCIATE or link the color to a column in the data.
Let's include a TREND line within the scatter plot to help visualize the linear relationship between the two continuous variables. Let's begin by IGNORING the potential influence of the non-numeric column.
Create a scatter plot which includes a trend line to show the linear relationship between the two numeric columns. You should NOT color based on the non-numeric columnn.
What kind of relationship does the TREND line represent when the non-numeric column is ignored?
sns.lmplot(data=df02, x= 'x1', y='x2')
plt.show()
A line almost parallel to x axis but appears to be little positive.
Let's now include TREND lines that are associated with the categories of the non-numeric column.
Create a scatter plot which includes trend lines to show the linear relationship between the numeric columns. Color the markers and the trend lines based on the non-numeric column.
Does the CONDITIONAL RELATIONSHIP appear DIFFERENT across the CATEGORIES?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
sns.lmplot(data=df02, x= 'x1', y='x2',hue='m1')
plt.show()
We can see a positive linear trend b/w x1 and x2 for m1=F,G,H,I(it seems increasing correlation value by view) and negative linear trend b/w x1 and x2 for m1=A,B,C,D,E(it seems increasing correlation value by view).
hue seaborn argument allows us to ASSOCIATE or link the color to a column in the data.
Lastly, let's FACET by the non-numeric column!
Create a scatter plot which includes trend lines to show the linear relationship between the numeric columns. Color the markers and trend lines and FACET based on the non-numeric column. The color and facets are therefore associated with the SAME variable.
The facets should have 3 columns per row.
Does the CONDITIONAL RELATIONSHIP appear DIFFERENT across the CATEGORIES?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the COLUMN FACET to a column in the data?
sns.lmplot(data=df02, x= 'x1', y='x2', col='m1',hue='m1', col_wrap=3)
plt.show()
We can see a positive linear trend b/w x1 and x2 for m1=F,G,H,I(it seems increasing correlation value by view) and negative linear trend b/w x1 and x2 for m1=A,B,C,D,E(it seems increasing correlation value by view).
hue and col seaborn argument allows us ASSOCIATE or link the COLUMN FACET to a column in the data.
You will continue working with the data from Problem 02 to explore the relationship between the two continuous variables.
Linear relationships can be summarized by calculating the correlation coefficient between the numeric columns. The correlation coefficients can be visualized as correlation plots via heat maps. However, let's first practice calculating the correlation matrix between the two numeric columns in df02.
Display the correlation matrix for the numeric columns in df02 to the screen. You do NOT need to assign the correlation matrix to an object.
df02.corr(numeric_only=True)
| x1 | x2 | |
|---|---|---|
| x1 | 1.000000 | 0.021982 |
| x2 | 0.021982 | 1.000000 |
Let's now VISUALIZE the correlation plot as a heat map!
Create a correlation plot between the numeric columns in df02. The correlation plot must be created using Seaborn. You must use a DIVERGING color palette with the correct bounds and midpoint. The correlation plot must be annotated.
You must ignore the non-numeric column for this correlation plot.
fig, ax = plt.subplots()
sns.heatmap(data = df02.corr(numeric_only=True),
vmin=-1, vmax=1, center = 0,
cmap='coolwarm',
annot=True, annot_kws={'size': 25},
ax=ax)
plt.show()
Let's now examine if the correlation plot CHANGES across the categories of the non-numeric column. However, let's practice calculating the grouped correlation matrix BEFORE visualizing the correlation plot.
Display the grouped correlation matrix for the numeric columns in df02 to the screen. You must group by the non-numeric column. You do NOT need to assign the correlation matrix to an object.
the_groups = df02.m1.unique().tolist()
corr_per_group = df02.groupby(['m1']).corr()
corr_per_group
| x1 | x2 | ||
|---|---|---|---|
| m1 | |||
| A | x1 | 1.000000 | -0.991282 |
| x2 | -0.991282 | 1.000000 | |
| B | x1 | 1.000000 | -0.880486 |
| x2 | -0.880486 | 1.000000 | |
| C | x1 | 1.000000 | -0.722998 |
| x2 | -0.722998 | 1.000000 | |
| D | x1 | 1.000000 | -0.395593 |
| x2 | -0.395593 | 1.000000 | |
| E | x1 | 1.000000 | -0.059890 |
| x2 | -0.059890 | 1.000000 | |
| F | x1 | 1.000000 | 0.270515 |
| x2 | 0.270515 | 1.000000 | |
| G | x1 | 1.000000 | 0.785730 |
| x2 | 0.785730 | 1.000000 | |
| H | x1 | 1.000000 | 0.902762 |
| x2 | 0.902762 | 1.000000 | |
| I | x1 | 1.000000 | 0.992068 |
| x2 | 0.992068 | 1.000000 |
Let's now VISUALIZE the grouped correlation plot!
Create a grouped correlation plot between the numeric columns in df02. You must group by the non-numeric column. The separate categories on the non-numeric column must be associated with separate subplots. The subplot title must be specified correctly to make it clear which subplot is associated with which value of the non-numeric column. You must use a DIVERGING color palette with the correct bounds and midpoint. The correlation plot must be annotated.
fig, axs = plt.subplots(1, len(the_groups), figsize=(18, 6), sharex=True, sharey=True )
for ix in range(len(the_groups)):
sns.heatmap( data = corr_per_group.loc[ the_groups[ ix ] ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 20},
ax=axs[ix] )
axs[ ix ].set_title('m1: %s' % the_groups[ ix ] )
plt.show()
You have visualized the distributions and relationship between the continuous variables in df02 several ways. Let's conclude by working with a plot type that combines both aspects into a single graphic.
Create a PAIRS PLOT to show the marginal histograms and scatter plot between the numeric columns in df02. You must ignore the non-column.
sns.pairplot(data = df02)
plt.show()
CONDITIONAL DISTRIBUTIONS and CONDITIONAL RELATIONSHIPS can be shown within a PAIRS PLOT. The non-numeric column can be associated with COLOR which creates separate colored CONDITIONAL DISTRIBUTIONS and separate colored MARKERS within the SCATTER PLOTS. You must COLOR the PAIRS PLOT by the non-numeric column.
HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?
sns.pairplot(data = df02, hue='m1')
plt.show()
You have visually explore the relationship between the numeric columns many different ways. You ignored the non-numeric column, as well as examined if the relationship CHANGED across the categories of the non-numeric column.
Which plot type did you feel was the easiet for identifying if the relationship changed across the categories of the non-numeric column?
Here we have 9(little big) different responses in category m1 and hence I feel corr plot or lm plot sperated on different columns or rows are better to identify relationships.
df04=pd.read_csv("hw07_prob_04.csv")
df04.head()
| x01 | x02 | x03 | x04 | x05 | x06 | x07 | x08 | x09 | x10 | x11 | x12 | v | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.264427 | 1.832210 | 6.131579 | 0.778511 | 1.536615 | 0.553812 | 1.301533 | 0.754062 | -2.869053 | 1.461101 | 0.665066 | 0.898219 | A1 |
| 1 | 1.192453 | 0.516253 | 5.152383 | -0.270222 | 0.959529 | 0.548572 | 0.397379 | 1.089376 | -2.918729 | 0.495790 | 0.769892 | 0.408373 | A1 |
| 2 | 0.687623 | 0.842107 | 4.688000 | 0.844798 | -0.121031 | 0.877528 | 0.153460 | 1.711498 | -3.579991 | 1.437499 | 1.126963 | 0.176703 | A1 |
| 3 | -0.440204 | -0.711678 | 4.073436 | -0.067530 | -0.140558 | -0.327413 | -0.752740 | -0.563624 | -5.244084 | -0.559143 | 0.034237 | -0.055663 | A1 |
| 4 | -0.017212 | 0.166355 | 5.063158 | 0.241118 | -0.145682 | -0.030299 | -0.040292 | -0.002534 | -4.329521 | 0.520308 | -0.306240 | -0.149979 | A1 |
df04.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 633 entries, 0 to 632 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x01 633 non-null float64 1 x02 633 non-null float64 2 x03 633 non-null float64 3 x04 633 non-null float64 4 x05 633 non-null float64 5 x06 633 non-null float64 6 x07 633 non-null float64 7 x08 633 non-null float64 8 x09 633 non-null float64 9 x10 633 non-null float64 10 x11 633 non-null float64 11 x12 633 non-null float64 12 v 633 non-null object dtypes: float64(12), object(1) memory usage: 64.4+ KB
df04.shape
(633, 13)
df04.nunique()
x01 633 x02 633 x03 633 x04 633 x05 633 x06 633 x07 633 x08 633 x09 633 x10 633 x11 633 x12 633 v 3 dtype: int64
df04.isna().sum()
x01 0 x02 0 x03 0 x04 0 x05 0 x06 0 x07 0 x08 0 x09 0 x10 0 x11 0 x12 0 v 0 dtype: int64
df04.v.value_counts()
A1 211 B2 211 C3 211 Name: v, dtype: int64
We have 633 rowsa divided equally into 3 categories A1,B2,C3. No col has any missing value.
sns.catplot(data=df04, x='v', kind='count')
plt.show()
All categories have equal count of 211.
It is best to study the marginal distributions and then conditional distributions associated with continuous variables (numeric columns) BEFORE exploring the relationships between them. However, we will modify the typical EDA workflow for this problem. Let's jump to using the PAIRS PLOT which allows exploring distributions and relationships within a single graphic. We will revisit the distributions in more detail later.
Create a PAIRS PLOT associated with all numeric columns in df04 using Seaborn.
What does this specific PAIRS PLOT reveal about the variables and their relationships?
sns.pairplot(data = df04)
plt.show()
We see that the main diagonal of the grid contains histogram/distribution of each variable. We see a lot of bell shaped histogram except x03,x09. The off-diagonal cells of the grid contain scatterplots of one variable against another. These scatterplots display how two different variables relate to each other. For most of the cases, we can't see any particular relationship. We can see linear relation in x2 vs x4, x1 vs x5 to name a few.
Let's now examine if the non-numeric column impacts the continuous variables. Create a PAIRS PLOT for the numeric columns and COLOR based on the non-numeric column using Seaborn.
What does this specific grouped PAIRS PLOT reveal about the impact of the non-numeric column on the continuous variables?
HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?
sns.pairplot(data = df04, hue='v',
diag_kws={'common_norm': False})
plt.show()
We see that the main diagonal of the grid contains histogram/distribution of each variable. Initially we saw a lot of bell shaped histogram except x03,x09. After the effect of categorical grouping x3, x9 are divided into 3 distinct bell shaped curves. The off-diagonal cells of the grid contain scatterplots of one variable against another. These scatterplots display how two different variables relate to each other. For most of the cases, we can't see any particular relationship. Initially we saw linear relation in x2 vs x4, x1 vs x5 to name a few. Now we start seeing linear relationship across categories e.g., x7 vs x9. In some cases like x10 vs x11 it is still hard to apprehend whether there is relationship among blue points or not as the color overlaps but we can see relationship in orange and green.
Let's now summarize the linear relationships between numeric columns using a CORRELATION PLOT. You do NOT need to display the correlation matrix first this time. Instead, we will jump straight to visualizing the CORRELATION PLOT.
Create a correlation plot between the numeric columns in df04. The correlation plot must be created using Seaborn. You must use a DIVERGING color palette with the correct bounds and midpoint.
Do you feel this correlation plot needs to be annotated? Try annoting the correlation plot and then NOT annotating it. Are you able to reach the same conclusions without the annotated text?
You must ignore the non-numeric column for this correlation plot.
fig, ax = plt.subplots(figsize=(18,18))
sns.heatmap(data = df04.corr(numeric_only=True),
vmin=-1, vmax=1, center = 0,
cmap='coolwarm',
annot=True, annot_kws={'size':10},
cbar=False,
ax=ax)
plt.show()
fig, ax = plt.subplots(figsize=(18,18))
sns.heatmap(data = df04.corr(numeric_only=True),
vmin=-1, vmax=1, center = 0,
cmap='coolwarm',
annot=False, annot_kws={'size':10},
cbar=True,
ax=ax)
plt.show()
I feel annotation is better. Without annotation definitely need colorbar on still it is difficult.
Let's now group the correlation plot by the non-numeric column.
Create a grouped correlation plot between the numeric columns in df04. You must group by the non-numeric column. The separate categories on the non-numeric column must be associated with separate subplots. The subplot title must be specified correctly to make it clear which subplot is associated with which value of the non-numeric column. You must use a DIVERGING color palette with the correct bounds and midpoint.
Do you feel this correlation plot needs to be annotated? Try annoting the correlation plot and then NOT annotating it. Are you able to reach the same conclusions without the annotated text?
the_groups1 = df04.v.unique().tolist()
corr_per_group1 = df04.groupby(['v']).corr()
fig, axs = plt.subplots(len(the_groups1),1, figsize=(18,50) )
for ix in range(len(the_groups1)):
sns.heatmap( data = corr_per_group1.loc[ the_groups1[ ix ] ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size':15 },
ax=axs[ix] )
axs[ ix ].set_title('v: %s' % the_groups[ ix ] )
plt.show()
fig, axs = plt.subplots(len(the_groups1),1, figsize=(18,50) )
for ix in range(len(the_groups1)):
sns.heatmap( data = corr_per_group1.loc[ the_groups1[ ix ] ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=True,
annot=False, annot_kws={'size':15 },
ax=axs[ix] )
axs[ ix ].set_title('v: %s' % the_groups[ ix ] )
plt.show()
I feel annotation is better. Without annotation definitely need colorbar on still it is difficult.
What were the pros and cons of exploring the RELATIONSHIPS between numeric columns with a PAIRS PLOTS for this data set?
What were the pros and cons of exploring the LINEAR relationships between the numeric columns with CORRELATION PLOTS for this data set?
Pair plot:
Pros
Cons
Pair plot:
Pros
Cons
Let's now return explore the continuous variable distributions in depth for df04. You have seen that there are more than just a few continuous variables in this data set! It might seem like we need to perform a lot of tedious actions to explore all of the variables. But, you do NOT need to manually create all figures! You do NOT need to resort to for-loops either! Instead, the data can be RESHAPED from the current WIDE-FORMAT to LONG-FORMAT. This allows associating Seaborn's FACETS with the continuous variables!
First, display the number of rows and columns in df04 as a reminder.
df04.shape
(633, 13)
Reshape the df04 WIDE-FORMAT DataFrame into LONG-FORMAT. The numeric columns of df04. MUST be "gathered up" or STACKED on top of each other. The non-numeric column must NOT be gathered up. You MUST include a column named rowid that corresponds to the row index. The rowid column must NOT be gathered up with the other numeric columns.
Assign the LONG-FORMAT data set to the lf04 object.
Display the .info() method for the LONG-FORMAT object to the screen.
df04_features = df04.select_dtypes('number').copy()
df04_features.melt(ignore_index=False)
lf04 = df04_features.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['rowid'])
lf04.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7596 entries, 0 to 7595 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 rowid 7596 non-null int64 1 variable 7596 non-null object 2 value 7596 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 178.2+ KB
lf04.shape
(7596, 3)
We have 7596 rows. Yes, it makes sense as we have 633 rows for 12 columns hence, $633 \times 12=7596$.
You can now use the LONG-FORMAT data to visually explore the numeric columns in df04!
Visualize the marginal distributions for each numeric variable in df04 using the LONG-FORMAT lf04 object and Seaborn. You must associate the correct newly created "gathered" value column with the x axis argument. You must associate the column facets with the correct newly created "gathered" variable column. You must use 21 bins to create the histograms. The figure should have 4 facets per row. The x and y scales of the facets must be free or not-shared across the facets.
How would you describe the SHAPES of the continuous variable distributions?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the COLUMN FACET to a column in the data?
df04_objects = df04.select_dtypes('object').copy()
id_cols = ['rowid'] + df04_objects.columns.to_list()
df04_lf = df04.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df04_features.columns)
sns.displot( data = df04_lf, x='value', col= 'variable', bins=21, aspect=2,col_wrap= 4,
facet_kws= {'sharex': False, 'sharey': False})
plt.show()
The lf04 LONG-FORMAT DataFrame has a separate column for the non-numeric column in df04. Thus, it was NOT "gathered" with the numeric columns. You can therefore use the non-numeric column as a GROUPING variable in the visualizations!
Visualize the CONDITIONAL KDE plots for each numeric variable in df04 within FACETS of a single figure. Each facet must be associated with one of the "original" numeric columns in df04. You must associate the correct newly created "gathered" value column in the x axis argument. You must associate the column facets with the correct newly created "gathered" variable column. You must associate the "original" df04 non-numeric column with the CONDITIONAL KDE color. The figure should have 4 facets per row. The x and y scales of the facets must be free or not-shared across the facets.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT across the categories of the non-numeric column?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?
sns.displot(data = df04_lf, x='value', col='variable', kind='kde',
col_wrap=4, hue='v',
facet_kws={'sharex': False, 'sharey': False},
common_norm=False)
plt.show()
Yes, some are distinct across categories. E.g., in x3 the orange and blue ones have a hint of second mode and all three categories here are bell-shaped but concentrated at three different regions, in x9 also all three categories here are bell-shaped but concentrated at three different regions. Then in x2 the orange one has two modes and the highest concentration of orange is little flat in x10. We see these variabilities there.
Although there are multiple conditional distribution plots we should use to fully explore the data, you will conclude this assignment with a BOXPLOT. You will create separate BOXPLOTS for each "original" numeric column within FACETS of a single figure. Each facet must be associated with one of the "original" numeric columns in df04. You must associate the "original" df04 non-numeric column with the x axis argument. You must associate the correct newly created "gathered" value column with the y axis argument. You must associate the column facets with the correct newly created "gathered" variable column.
Experiment with using shared x and y axis scales across the FACETS and NOT SHARING the x and y axis scales. Which approach seems best for this particular data set?
#NOT SHARING the x and y axis scales
sns.catplot(data = df04_lf, x='v', y='value', col='variable', col_wrap=4,
kind='box',
sharey=False)
plt.show()
# SHARING the x and y axis scales
sns.catplot(data = df04_lf, x='v', y='value', col='variable', col_wrap=4,
kind='box')
plt.show()
Not sharing is visually clearer in this condition.
You must download the data associated with one of the Final Projects from the Canvas site. Save the file(s) in the same directory as this Jupyter notebook. You may use the same project as the previous assignment OR switch to a different project.
Read in the data associated with one of the Final Projects. You previously visually explored MARGINAL behavior. You must now begin to visually explore relationships between variables in the Project data. However, you do NOT need to explore ALL relationships this assignment.
You MUST create at least 6 plots which explore relationships between variables. Those plots can be categorical-to-categorical relationships (combinations), categorical-to-continuous relationships, and/or continuous-to-continuous relationships. The exact type of plots you should use depend on the project.
However, 2 of the plots MUST involve MORE than 2 variables.
Add as many cells as you feel are necessary.
df06 = pd.read_csv('training_data.csv')
df06
| X01 | X02 | X03 | X04 | X05 | X06 | X07 | X08 | X09 | X10 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -2.907070 | 1.266914 | -0.332039 | -0.248782 | 0.200432 | -0.008683 | 0.316866 | -0.323924 | 0.030199 | -0.205569 | ... | 0.552065 | 0.489846 | 1.113175 | -0.240931 | -0.108875 | -0.114766 | 0.841632 | 0.104236 | -121.810994 | 1 |
| 1 | -4.608052 | 4.672474 | 0.154697 | 0.268719 | -0.842417 | 0.055191 | 0.622848 | -0.260097 | -0.651079 | 1.096821 | ... | 1.989505 | 1.355984 | 1.656029 | 2.428749 | 1.068637 | 1.945175 | 0.950544 | 0.143290 | -59.362086 | 1 |
| 2 | 4.338816 | 5.684974 | 1.868370 | -1.883006 | 0.589758 | 0.932240 | -0.646026 | 0.183410 | 0.132287 | -0.426386 | ... | 2.128248 | 2.553980 | 2.661607 | 2.625942 | 4.462401 | 3.621299 | 0.992381 | 0.127803 | -79.575912 | 1 |
| 3 | -1.835062 | 0.427501 | -2.226023 | 0.700375 | -1.144850 | 1.188100 | 0.727831 | -0.271734 | 0.003246 | 0.138308 | ... | -0.856860 | -0.766993 | -0.882442 | -0.832196 | -0.377106 | -0.633452 | 0.964183 | 0.088978 | -139.426151 | 0 |
| 4 | 13.990969 | -3.877269 | 1.921605 | 0.162288 | 2.316402 | -0.161137 | -0.099180 | 0.514620 | -0.551956 | -0.517779 | ... | -0.333291 | 0.126656 | -0.557930 | -0.185135 | -0.639549 | -0.245234 | 0.845817 | 0.172305 | -29.433234 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 218 | -3.454719 | -0.911378 | 1.072785 | 0.845129 | 1.817682 | -0.034388 | -0.394277 | -0.260803 | 0.018067 | -0.718457 | ... | -1.041533 | -0.585809 | -0.831599 | 0.315496 | -2.098443 | -0.767479 | 0.941532 | 0.112127 | -100.512718 | 0 |
| 219 | 0.493271 | 2.184699 | 0.107755 | -1.852857 | 0.329977 | 0.679100 | 0.933463 | 0.110013 | 0.234102 | -0.080677 | ... | 1.599908 | 1.485953 | 1.326105 | 0.738920 | 1.782694 | 1.221559 | 0.965555 | 0.098206 | -118.299085 | 1 |
| 220 | -5.009510 | -0.488819 | 0.838883 | 0.615571 | 0.069954 | -0.365945 | -0.290072 | 0.777958 | -0.392741 | 0.126245 | ... | -0.769891 | -0.384573 | -0.894474 | -1.071848 | -0.755456 | -1.001084 | 0.898706 | 0.106729 | -102.237684 | 1 |
| 221 | -2.665672 | -0.546497 | -0.545406 | -0.477273 | 1.476238 | -0.019403 | 0.280312 | 0.359992 | -0.136810 | 0.116397 | ... | -0.092393 | -0.962929 | -0.580874 | -0.750556 | -1.618888 | -1.083649 | 0.964139 | 0.096673 | -119.015441 | 1 |
| 222 | 0.055307 | -1.709225 | -0.763259 | -0.755982 | -0.106586 | -1.174575 | -0.608212 | -0.483285 | -0.055759 | -0.658711 | ... | 0.076693 | 0.434874 | 0.371620 | 0.947893 | 0.775466 | 0.472459 | 0.914727 | 0.137963 | -78.872248 | 1 |
223 rows × 64 columns
df06_features = df06[['X01','X02','Z01','Z02', 'V01', 'V02' ]].copy()
df06_features.melt(ignore_index=False)
lf06 = df06_features.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['rowid'])
lf06.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 rowid 1338 non-null int64 1 variable 1338 non-null object 2 value 1338 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 31.5+ KB
df06_objects = df06.select_dtypes('object').copy()
id_cols = ['rowid'] + df06_objects.columns.to_list()
df06_lf = df06.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df06_features.columns)
sns.displot( data = df06_lf, x='value', col= 'variable', bins=21, aspect=2,col_wrap= 3, common_bins=False,
facet_kws= {'sharex': False, 'sharey': False})
plt.show()
sns.displot(data = df06_lf, x='value', col='variable', kind='kde',
col_wrap=3,
facet_kws={'sharex': False, 'sharey': False},
common_norm=False)
plt.show()
sns.pairplot(data = df06_features, diag_kws={'common_norm': False})
plt.show()
fig, ax = plt.subplots()
sns.heatmap(data = df06_features.corr(numeric_only=True),
vmin=-1, vmax=1, center = 0,
cmap='coolwarm',
annot=True, annot_kws={'size': 10},
ax=ax)
plt.show()
I have done with pairplots with long format which has more than 6 graphs.